About the dataset¶

Name: This column contains the full name of the athlete participating in the Olympic Games.
Sex: This column indicates the gender of the athlete. It has two unique values: "M" for male and "F" for female.
Age: This column represents the age of the athlete at the time of the competition.
Team: The name of the team or delegation that the athlete represents in Olympics.
NOC: It contains the three-letter country code assigned by the National Olympic Committee (NOC).
Year: Represents the year in which the Olympic Games took place.
Season: Indicates whether the Olympic Games occurred in the "Summer" or "Winter" season. This distinction is important because different sports are played in each season.
City: The host city where the Olympic event took place. This information can be useful for analyzing the impact of location and climate conditions on athlete performance.
Sport: Represents the broad category of the sport in which the athlete competed (e.g., Athletics, Swimming, Gymnastics).
Event: The specific event within a sport in which the athlete participated (e.g., "100m Sprint", "Long Jump").
Medal: Indicates the type of medal won by the athlete. Possible values include "Gold", "Silver", "Bronze", or "NaN" if no medal was won.
Country: This column represents the full country name corresponding to the NOC code.
Height: The height of the athlete is in centimetres.
Weight: The weight of the athlete in kilograms.

Import the Libraries and Load the Data¶

In [1]:
#Import the Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import OneHotEncoder
Requirement already satisfied: xgboost in c:\users\user5\anaconda3\lib\site-packages (2.1.4)
Requirement already satisfied: numpy in c:\users\user5\anaconda3\lib\site-packages (from xgboost) (1.24.3)
Requirement already satisfied: scipy in c:\users\user5\anaconda3\lib\site-packages (from xgboost) (1.11.1)
In [2]:
#Loading the file
df = pd.read_csv('Athletes_summer_games.csv')
df_athlete = pd.read_csv('Olympic_Athlete_Biography.csv')
df_region = pd.read_csv('Olympic_Country_Profiles.csv')
In [3]:
df.sample(1)
Out[3]:
Unnamed: 0 Name Sex Age Team NOC Games Year Season City Sport Event Medal
95483 115751 Mariappa Kempaiah M 23.0 India IND 1956 Summer 1956 Summer Melbourne Football Football Men's Football NaN
In [4]:
# Getting all column name in same format
df_region.rename(columns={'noc':'NOC','country':'Country'},inplace = True)
In [5]:
df_region.sample(1)
Out[5]:
NOC Country
39 CHA Chad
In [6]:
# Merging Country in main dataset
df = df.merge(df_region,on='NOC',how = 'left')
In [7]:
# dropping unnecessary column
df = df.drop(columns=['Unnamed: 0','Games'])
In [8]:
df.sample(1)
Out[8]:
Name Sex Age Team NOC Year Season City Sport Event Medal Country
169486 Alberto Ruiz Benito M 30.0 Spain ESP 1992 Summer Barcelona Athletics Athletics Men's Pole Vault NaN Spain
In [9]:
# dropping unnecessary column and Getting all column name in same format
df_athlete = df_athlete.drop(columns=['athlete_id','sex','born','country','country_noc','description','special_notes'])
df_athlete.rename(columns={'name':'Name','height':'Height','weight':'Weight'},inplace=True)
df_athlete.sample(1)
Out[9]:
Name Height Weight
12502 Ellie Faulkner 165.0 68.0
In [10]:
# Merging Height & Weight in main dataset
df = df.merge(df_athlete,on='Name',how='left')
In [11]:
df.sample(3)
Out[11]:
Name Sex Age Team NOC Year Season City Sport Event Medal Country Height Weight
41641 Natasha De'Anka "Tasha" Danvers (-Smith-) F 22.0 Great Britain GBR 2000 Summer Sydney Athletics Athletics Women's 4 x 400 metres Relay NaN Great Britain NaN NaN
137962 Ibtihaj Muhammad F 30.0 United States USA 2016 Summer Rio de Janeiro Fencing Fencing Women's Sabre, Team Bronze United States 170.0 68.0
226066 ABDEL LATIF Radwa F 31.0 Egypt EGY 2020 Summer Tokyo Shooting 10m Air Pistol Mixed Team NaN Egypt NaN NaN

Understanding of the data¶

In [13]:
##Dimensions of the data
df.shape
Out[13]:
(241723, 14)
In [14]:
#Information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241723 entries, 0 to 241722
Data columns (total 14 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   Name     241723 non-null  object 
 1   Sex      241723 non-null  object 
 2   Age      232357 non-null  float64
 3   Team     241723 non-null  object 
 4   NOC      241723 non-null  object 
 5   Year     241723 non-null  int64  
 6   Season   241723 non-null  object 
 7   City     241723 non-null  object 
 8   Sport    241723 non-null  object 
 9   Event    241723 non-null  object 
 10  Medal    37196 non-null   object 
 11  Country  241469 non-null  object 
 12  Height   68601 non-null   float64
 13  Weight   68601 non-null   float64
dtypes: float64(3), int64(1), object(10)
memory usage: 25.8+ MB
In [15]:
# Converted all country values into string
df['Country'] = df['Country'].astype(str)
In [16]:
#Information about the dataset 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 241723 entries, 0 to 241722
Data columns (total 14 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   Name     241723 non-null  object 
 1   Sex      241723 non-null  object 
 2   Age      232357 non-null  float64
 3   Team     241723 non-null  object 
 4   NOC      241723 non-null  object 
 5   Year     241723 non-null  int64  
 6   Season   241723 non-null  object 
 7   City     241723 non-null  object 
 8   Sport    241723 non-null  object 
 9   Event    241723 non-null  object 
 10  Medal    37196 non-null   object 
 11  Country  241723 non-null  object 
 12  Height   68601 non-null   float64
 13  Weight   68601 non-null   float64
dtypes: float64(3), int64(1), object(10)
memory usage: 25.8+ MB

Interpretation: The dataset consists of 2,41,723 rows representing Olympic athlete data over multiple years. It includes 14 columns with detailed information about athletes such as Name, Sex, Age, Team, Country, participated Sport with Event and Medal column represents whether he won the medal or not. Notably, Several columns like Age, Height, Weight and Medal have missing values.

In [17]:
# Missing values
df.isnull().sum()
Out[17]:
Name            0
Sex             0
Age          9366
Team            0
NOC             0
Year            0
Season          0
City            0
Sport           0
Event           0
Medal      204527
Country         0
Height     173122
Weight     173122
dtype: int64

Interpretation: The data displayed shows various columns related to personal information of athletes, countries participation in olympic games. Notably, Out of 4 columns 3 columns have higher missing values which indicates that data points are not recorded or unavailable in the dataset and Medal column have large portion of Nan values means most of the athletes did not win a medal so missing values could be interpretted as Nan. This could affect the analysis related to performance of athletes in the olympic games.

In [18]:
# checking duplicates values.
df.duplicated().sum()
Out[18]:
1765

Interpretation: In this dataset 1765 duplicate values which will impact the analysis related to performace of athletes in olympic games.

In [19]:
# Now there is no duplicate values.
df.drop_duplicates(inplace=True)
df.duplicated().sum()
Out[19]:
0
In [20]:
# Checking spcific values of medals 
df['Medal'].value_counts()
Out[20]:
Medal
Gold      12459
Bronze    12436
Silver    12240
Name: count, dtype: int64

Interpretition: The "Medal" column represents the count of different types of medals won in the dataset. The count for 3 categories are relatively close, indicating that the olympic medal distribution is fairly even across all three medal types. Gold medals have a slightly higher count than Silver and Bronze but the difference is small. This suggests that across all olympic events in the dataset, the number of gold medals awarded is slightly higher than the other two.

In [21]:
# applying One-Hot Encoding & concating it to main dataset
df = pd.concat([df,pd.get_dummies(df['Medal']).astype(int)],axis=1)
In [22]:
# Need to maintain consistancy in Country column 
Country_Mapping={'ROC':'Russian Olympic Committee'}
df['Country'] = df['Country'].replace(Country_Mapping)
In [23]:
df
Out[23]:
Name Sex Age Team NOC Year Season City Sport Event Medal Country Height Weight Bronze Gold Silver
0 A Dijiang M 24.0 China CHN 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN People's Republic of China NaN NaN 0 0 0
1 A Lamusi M 23.0 China CHN 2012 Summer London Judo Judo Men's Extra-Lightweight NaN People's Republic of China 170.0 60.0 0 0 0
2 Gunnar Nielsen Aaby M 24.0 Denmark DEN 1920 Summer Antwerpen Football Football Men's Football NaN Denmark NaN NaN 0 0 0
3 Edgar Lindenau Aabye M 34.0 Denmark/Sweden DEN 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold Denmark NaN NaN 0 1 0
4 Cornelia "Cor" Aalten (-Strannood) F 18.0 Netherlands NED 1932 Summer Los Angeles Athletics Athletics Women's 100 metres NaN Netherlands NaN NaN 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
241718 ZYKOVA Yulia F 25.0 Russia ROC 2020 Summer Tokyo Shooting 50m Rifle 3 Positions Women Silver Russian Olympic Committee NaN NaN 0 0 1
241719 ZYUZINA Ekaterina F 24.0 Russia ROC 2020 Summer Tokyo Sailing Women's One Person Dinghy - Laser Radial NaN Russian Olympic Committee NaN NaN 0 0 0
241720 ZYUZINA Ekaterina F 24.0 Russia ROC 2020 Summer Tokyo Sailing Women's One Person Dinghy - Laser Radial NaN Russian Olympic Committee NaN NaN 0 0 0
241721 ZYZANSKA Sylwia F 24.0 Poland POL 2020 Summer Tokyo Archery Women's Individual NaN Poland NaN NaN 0 0 0
241722 ZYZANSKA Sylwia F 24.0 Poland POL 2020 Summer Tokyo Archery Mixed Team NaN Poland NaN NaN 0 0 0

239958 rows × 17 columns

In [25]:
athlete_df = df.drop_duplicates(subset=['Name','NOC'])
In [26]:
athlete_df['Medal'].fillna('No Medal',inplace=True)
In [27]:
Sex_Count = athlete_df['Sex'].value_counts().reset_index()
Sex_Count.columns = ['Sex','Count']
fig = px.bar(Sex_Count, x='Sex', y='Count', color='Sex', 
             title="Participating Male Female Distribution",
             text = 'Count')
fig.update_layout(title_x=0.5)
fig.show()

Interpretition: There are more male athletes than the female athletes participated in the olympic games history. Specifically 94,590 and 34677 males and females respectively till 2020 edition of olympic.

In [28]:
athlete_df = athlete_df.dropna(subset=['Age'])
In [29]:
plt.hist(athlete_df['Age'],bins=5,color='skyblue')
Out[29]:
(array([9.3913e+04, 2.7607e+04, 1.4640e+03, 1.2300e+02, 5.0000e+00]),
 array([10. , 27.4, 44.8, 62.2, 79.6, 97. ]),
 <BarContainer object of 5 artists>)
No description has been provided for this image

Interpretation: The histogram shows that most athletes are concentrated in the younger age group, particularly between 10 and 30 years old. The number of athletes declines sharply after age 30, with very few participating beyond 60. This suggests that competitive athletic participation is heavily skewed toward younger individuals.

In [30]:
plt.figure(figsize=(10,5))

sns.boxplot(x="Medal", y="Age", data=df, order=['Gold', 'Silver', 'Bronze'], palette="coolwarm")
plt.title("Age Distribution of Medalists")
plt.xlabel("Medal Type")
plt.ylabel("Age")
plt.show()
No description has been provided for this image

Interpretation: The boxplot shows that the median age of medalists is similar across all medal types Gold, Silver and Bronze ranging around the mid-20s. Most medal winners are between 20 and 30 years old, with a few outliers extending into the 40s and beyond. This indicates that top performance tends to occur during early adulthood.

In [32]:
x1 = athlete_df['Age'].dropna().tolist()
x2 = athlete_df[athlete_df['Medal'] == 'Gold']['Age'].dropna().tolist()
x3 = athlete_df[athlete_df['Medal'] == 'Silver']['Age'].dropna().tolist()
x4 = athlete_df[athlete_df['Medal'] == 'Bronze']['Age'].dropna().tolist()
In [33]:
fig = ff.create_distplot([x1,x2,x3,x4],['Overall Age', 'Gold Medalist', 'Silver Medalist', 'Bronze Medalist'],show_hist=False,show_rug=False)
fig.update_layout(
    title= "Age Analysis of Olympic Medalists Across Different Categories",
    title_x=0.45, 
    title_y=0.85)
fig.show()

Interpretation: The distribution plot shows that most Olympic medalists, regardless of medal type, tend to be in their early to mid-20s, with peak density around age 24. Gold medalists have a slightly sharper peak, suggesting a tighter age range for top performance. Overall, age trends are quite similar across all medal categories.

In [34]:
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
sns.histplot(athlete_df['Height'],bins=30,kde=True,color='blue')
plt.title('Height Distribution of athletes')

plt.subplot(1,2,2)
sns.histplot(athlete_df['Weight'],bins=30,kde=True,color='green')
plt.title('Weight Distribution of athletes')

plt.show()
No description has been provided for this image

Interpretation: The height distribution of athletes is roughly normal, with most athletes falling between 165 cm and 185 cm. The weight distribution is slightly right-skewed, indicating a higher concentration around 65–75 kg but with a longer tail toward heavier weights. Both distributions show typical ranges for elite athletes.

In [35]:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=athlete_df, x="Weight", y="Height",hue='Sex',alpha=0.5)
plt.title("Height vs. Weight of Athletes")
plt.xlabel("Weight (kg)")
plt.ylabel("Height (cm)")
plt.show()
No description has been provided for this image

Interpretation: The scatter plot shows a positive correlation between height and weight among athletes, with males generally being taller and heavier than females. There's a clear clustering, with male athletes occupying the upper-right region and female athletes the lower-left, reflecting typical physiological differences.

In [36]:
plt.figure(figsize=(12,6))

plt.subplot(1,2,1)
sns.boxplot(athlete_df,x='Sex',y='Height',palette='pastel')
plt.title('Height Distribution by Genders')

plt.subplot(1,2,2)
sns.boxplot(athlete_df,x='Sex',y='Weight',palette='muted')
plt.title('Weight Distribution by Genders')

plt.show()
No description has been provided for this image

Interpretation: The box plots show that male athletes generally have greater height and weight compared to female athletes. Both distributions exhibit some outliers, but the median height and weight are clearly higher for males, reflecting typical physiological differences between the genders.

In [37]:
medal_order = ['Gold', 'Silver', 'Bronze']

medalists_df = athlete_df[athlete_df['Medal'].isin(medal_order)]
medalists_df['Medal'] = pd.Categorical(medalists_df['Medal'], categories=medal_order, ordered=True)

fig = make_subplots(rows=1, cols=2, subplot_titles=("Height Distribution of Medalist", "Weight Distribution of Medalist"))

medal_colors = {'Gold': 'gold', 'Silver': 'silver', 'Bronze': 'brown'}

for medal in medal_order:
    temp_df = medalists_df[medalists_df['Medal'] == medal]
    fig.add_trace(go.Box(y=temp_df["Height"], name=f"{medal} Medal", marker_color=medal_colors[medal]), row=1, col=1)

for medal in medal_order:
    temp_df = medalists_df[medalists_df['Medal'] == medal]
    fig.add_trace(go.Box(y=temp_df["Weight"], name=f"{medal} Medal", marker_color=medal_colors[medal]), row=1, col=2)

fig.update_layout(title_text="Height & Weight Distributions of Medalists (Ordered by Medal Type)",title_x=0.5,showlegend=True)

fig.show()

Interpretation: The box plots show that gold medalists tend to be slightly taller and heavier on average compared to silver and bronze medalists. While the distributions overlap, gold medalists have a higher median in both height and weight, suggesting a potential physical advantage in certain sports.

In [38]:
plt.figure(figsize=(8,8))
selected_sport = 'Weightlifting'
temp_df = athlete_df[athlete_df['Sport']== selected_sport]
sns.scatterplot(temp_df,x='Weight',y='Height',hue='Medal',style='Sex',s=100)
plt.title(f'Height vs Weight Distribution in {selected_sport}', fontsize=14)
Out[38]:
Text(0.5, 1.0, 'Height vs Weight Distribution in Weightlifting')
No description has been provided for this image

Interpretation: In weightlifting, medalists (especially gold and bronze) are spread across a range of weights, but tend to cluster in the mid-to-high weight categories. Both male and female athletes compete across the height-weight spectrum, though males (marked with "x") dominate the higher weight and height ranges.

In [39]:
fig = px.scatter(df, x="Weight", y="Height", color="Sport",
                 title="Height vs Weight of Athletes by Sports",
                 hover_data=['Name', 'Sex', 'Medal'])
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: The scatter plot shows distinct clusters of height and weight distributions by sport. Sports like basketball and volleyball have taller and heavier athletes, while gymnastics and badminton tend to have shorter and lighter participants. This reflects how physical attributes vary significantly based on the demands of each sport.

In [40]:
df.groupby('NOC').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index().head(10)
Out[40]:
NOC Gold Silver Bronze
0 USA 2587 1445 1277
1 URS 832 636 597
2 GBR 679 778 679
3 GER 619 569 689
4 ITA 546 499 498
5 FRA 528 611 624
6 CHN 480 438 328
7 HUN 442 336 402
8 AUS 378 480 583
9 SWE 362 426 358
In [41]:
df[(df['NOC']== 'IND') & (df['Medal']== 'Gold')]
Out[41]:
Name Sex Age Team NOC Year Season City Sport Event Medal Country Height Weight Bronze Gold Silver
4239 Shaukat Ali M 30.0 India IND 1928 Summer Amsterdam Hockey Hockey Men's Hockey Gold India NaN NaN 0 1 0
4243 Syed Mushtaq Ali M 22.0 India IND 1964 Summer Tokyo Hockey Hockey Men's Hockey Gold India 165.0 61.0 0 1 0
4513 Richard James Allen M 25.0 India IND 1928 Summer Amsterdam Hockey Hockey Men's Hockey Gold India NaN NaN 0 1 0
4514 Richard James Allen M 30.0 India IND 1932 Summer Los Angeles Hockey Hockey Men's Hockey Gold India NaN NaN 0 1 0
4515 Richard James Allen M 34.0 India IND 1936 Summer Berlin Hockey Hockey Men's Hockey Gold India NaN NaN 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
196614 Carlyle Carrol Tapsell M 23.0 India IND 1932 Summer Los Angeles Hockey Hockey Men's Hockey Gold India NaN NaN 0 1 0
196615 Carlyle Carrol Tapsell M 27.0 India IND 1936 Summer Berlin Hockey Hockey Men's Hockey Gold India NaN NaN 0 1 0
208715 Max "Maxie" Vaz M NaN India IND 1948 Summer London Hockey Hockey Men's Hockey Gold India NaN NaN 0 1 0
222666 Sayed Muhammad Yusuf M NaN India IND 1928 Summer Amsterdam Hockey Hockey Men's Hockey Gold India NaN NaN 0 1 0
228552 CHOPRA Neeraj M 23.0 India IND 2020 Summer Tokyo Athletics Men's Javelin Throw Gold India NaN NaN 0 1 0

136 rows × 17 columns

In [42]:
medal_tally = df.drop_duplicates(subset=['Country','NOC','Year','Season','City','Sport','Event','Medal'])
In [43]:
medal_tally.shape
Out[43]:
(112833, 17)
In [44]:
medal_tally = medal_tally.groupby('NOC').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index()
In [45]:
medal_tally[medal_tally['NOC']=='IND']
Out[45]:
NOC Gold Silver Bronze
52 IND 10 9 16
In [46]:
medal_tally['Total']=medal_tally['Gold'] + medal_tally['Silver'] + medal_tally['Bronze']
In [47]:
medal_tally
Out[47]:
NOC Gold Silver Bronze Total
0 USA 1074 843 740 2657
1 URS 394 317 294 1005
2 GBR 300 337 320 957
3 CHN 265 194 171 630
4 FRA 243 267 293 803
... ... ... ... ... ...
228 AHO 0 1 0 1
229 LBN 0 0 0 0
230 LBR 0 0 0 0
231 LCA 0 0 0 0
232 LBA 0 0 0 0

233 rows × 5 columns

Interpretation: Initially, the medal tally was inflated due to duplicate entries from team sports, where each team member received a medal. To correct this, duplicates were dropped based on key event identifiers, ensuring each medal is counted only once per event. This results in a more accurate representation of each country's medal count.

In [48]:
Year = df['Year'].unique().tolist()
In [49]:
Year.sort()
In [50]:
Year.insert(0,'Overall')
In [51]:
Country = df['Country'].unique().tolist()
In [52]:
Country.sort()
In [53]:
Country.insert(0,'Overall')
In [54]:
def fetch_medal_tally(df,Year,Country):
    medal_df = df.drop_duplicates(subset=['Country','NOC','Year','Season','City','Sport','Event','Medal'])    
    flag = 0
    if Year == 'Overall' and Country =='Overall':
        temp_df = medal_df
    if Year == 'Overall' and Country !='Overall':
        flag = 1
        temp_df = medal_df[medal_df['Country'] == Country]
        
    if Year != 'Overall' and Country == 'Overall':
        temp_df = medal_df[medal_df['Year'] == int(Year)]
        
    if Year != 'Overall' and Country !='Overall':
        temp_df = medal_df[(medal_df['Year'] == int(Year)) & (medal_df['Country'] == Country)]
    
    if flag ==1: 
        x= temp_df.groupby('Year').sum()[['Gold','Silver','Bronze']].sort_values('Year').reset_index()
    else:
        x= temp_df.groupby('NOC').sum()[['Gold','Silver','Bronze']].sort_values('Gold',ascending=False).reset_index()

    x['Total'] = x['Gold'] + x['Silver'] + x['Bronze']
    
    print(x)
In [55]:
fetch_medal_tally(df,Year='Overall', Country = 'Overall')
     NOC  Gold  Silver  Bronze  Total
0    USA  1074     843     740   2657
1    URS   394     317     294   1005
2    GBR   300     337     320    957
3    CHN   265     194     171    630
4    FRA   243     267     293    803
..   ...   ...     ...     ...    ...
228  AHO     0       1       0      1
229  LBN     0       0       0      0
230  LBR     0       0       0      0
231  LCA     0       0       0      0
232  LBA     0       0       0      0

[233 rows x 5 columns]

Interpretation: This function fetch_medal_tally() is designed to return a customized medal tally based on the selected year and country. It first removes duplicate medals using a unique combination of event-related columns to avoid overcounting, especially for team sports. Depending on the inputs, it can show the overall medal tally by country, by year, or the performance of a specific country across years. The flag variable helps determine whether to group the result by Year (for single country) or by NOC (for overall view). The final output includes the total count of Gold, Silver, and Bronze medals, along with the overall total.

In [56]:
medal_df = df.drop_duplicates(subset=['Name','Sex','NOC','Team','Country','Sport','Event','Medal','Year','Season'],inplace=True)
In [57]:
def most_successful(medal_df,Sport='Overall'):
    temp_df = medal_df.dropna(subset=['Medal'])
    
    if Sport != 'Overall':
        temp_df = temp_df[temp_df['Sport'] == Sport]
        
    x = temp_df['Name'].value_counts().reset_index().head(15).merge(df,on='Name',how='left')[['Name','count','Sport','Country']].drop_duplicates('Name')
    x.rename(columns={'count':'Medals'},inplace=True)
    
    return x
In [58]:
 most_successful(df,'Overall')
Out[58]:
Name Medals Sport Country
0 Michael Fred Phelps, II 28 Swimming United States
30 Larysa Semenivna Latynina (Diriy-) 18 Gymnastics Soviet Union
49 Nikolay Yefimovich Andrianov 15 Gymnastics Soviet Union
73 Takashi Ono 13 Gymnastics Japan
106 Edoardo Mangiarotti 13 Fencing Italy
120 Borys Anfiyanovych Shakhlin 13 Gymnastics Soviet Union
144 Sawao Kato 12 Gymnastics Japan
168 Aleksey Yuryevich Nemov 12 Gymnastics Russian Federation
189 Dara Grace Torres (-Hoffman, -Minas) 12 Swimming United States
202 Jennifer Elisabeth "Jenny" Thompson (-Cumpelik) 12 Swimming United States
219 Birgit Fischer-Schmidt 12 Canoeing East Germany
232 Ryan Steven Lochte 12 Swimming United States
246 Paavo Johannes Nurmi 12 Athletics Finland
258 Natalie Anne Coughlin (-Hall) 12 Swimming United States
270 Viktor Ivanovych Chukarin 11 Gymnastics Soviet Union

Interpretation: The most_successful() function identifies the top 15 most decorated athletes based on total medals won. It first filters out any entries without a medal and optionally narrows the data down to a specific sport if provided. The athletes are ranked by how many times their names appear in medal-winning records. It then merges with the original dataframe to extract corresponding sport and country information, ensuring each athlete appears only once. This gives a concise summary of the most successful athletes overall or within a chosen sport.

In [59]:
def most_successful(df,Country):
    temp_df = df.dropna(subset=['Medal'])
    
    temp_df = temp_df[temp_df['Country'] == Country]
        
    x = temp_df['Name'].value_counts().reset_index().head(15).merge(df,on='Name',how='left')[['Name','count','Sport']].drop_duplicates('Name')
    x.rename(columns={'count':'Medals'},inplace=True)
    
    return x
In [60]:
most_successful(df,'United States')
Out[60]:
Name Medals Sport
0 Michael Fred Phelps, II 28 Swimming
30 Jennifer Elisabeth "Jenny" Thompson (-Cumpelik) 12 Swimming
47 Ryan Steven Lochte 12 Swimming
61 Dara Grace Torres (-Hoffman, -Minas) 12 Swimming
74 Natalie Anne Coughlin (-Hall) 12 Swimming
86 Mark Andrew Spitz 11 Swimming
98 Matthew Nicholas "Matt" Biondi 11 Swimming
110 Carl Townsend Osburn 11 Shooting
126 Frederick Carlton "Carl" Lewis 10 Athletics
136 Gary Wayne Hall, Jr. 10 Swimming
146 Raymond Clarence "Ray" Ewry 10 Athletics
156 Shirley Frances Babashoff 9 Swimming
167 Martin Joseph Sheridan 9 Athletics
182 Allyson Michelle Felix 9 Athletics
192 Donald Arthur "Don" Schollander 8 Swimming

Interpretation: The most_successful() function here is tailored to identify the top 15 medal-winning athletes from a specific country—in this case, the United States. It filters out non-medal entries and focuses only on the athletes from the chosen country. Then, it ranks the athletes by the number of medals won, merges to include their sport, and removes duplicates to avoid repetition. The result highlights the most successful U.S. athletes along with their medal counts and primary sports.

In [61]:
temp_df = df.dropna(subset=['Medal'])
temp_df.drop_duplicates(subset=['NOC','Team','Sport','Event','Medal','Year','City'],inplace=True)
In [62]:
selected_country = 'United States'
new_df = temp_df[temp_df['Country']== selected_country]
final_df = new_df.groupby('Year').count()['Medal'].reset_index()

fig = px.line(final_df,x = 'Year',y='Medal',markers=True)
fig.update_layout(title=f"Medal Distribution by Sport and Year for {selected_country} in the Olympics",
                 title_x=0.5,)
fig.show()

Interpretation: This line chart visualizes the trend of Olympic medals won by the United States over the years. There are noticeable peaks, such as around 1904 and 1984, likely reflecting strong performances or hosting years. Overall, the U.S. has maintained a relatively consistent medal count in recent decades, highlighting its sustained competitiveness in the Olympics.

In [63]:
selected_country = 'United States'
new_df = temp_df[temp_df['Country'] == selected_country]
plt.figure(figsize=(20,20))
sns.heatmap(new_df.pivot_table(index='Sport',columns='Year',values='Medal',aggfunc='count').fillna(0).astype(int),annot=True)
plt.title(f"Medal Distribution by Sport and Year for {selected_country} in the Olympics",fontsize=16,fontweight="bold")
Out[63]:
Text(0.5, 1.0, 'Medal Distribution by Sport and Year for United States in the Olympics')
No description has been provided for this image

Interpretation: This heatmap illustrates the distribution of medals won by the United States across different sports and Olympic years. It highlights that Athletics and Swimming have been the most dominant sports for the U.S., consistently yielding high medal counts. Sporadic participation and performance are observed in niche or newer sports, indicated by sparser entries.

Overall Analysis - Top Statistics¶

• No of Editions
• No of Cities
• No of Sports/Events
• No of Athletes
• Participating Nations

In [66]:
# No of Editions
df['Year'].unique().shape[0]
Out[66]:
30

Interpretation: This code returns the total number of unique Olympic years present in the dataset. This helps understand the temporal coverage of the data for trend or time-series analysis.

In [67]:
# Names of the Cities
df['City'].unique()
Out[67]:
array(['Barcelona', 'London', 'Antwerpen', 'Paris', 'Los Angeles',
       'Helsinki', 'Sydney', 'Atlanta', 'Stockholm', 'Beijing',
       'Rio de Janeiro', 'Athina', 'Mexico City', 'Munich', 'Seoul',
       'Berlin', 'Melbourne', 'Roma', 'Amsterdam', 'Montreal', 'Moskva',
       'Tokyo', 'St. Louis'], dtype=object)
In [68]:
df['City'].unique().shape
Out[68]:
(23,)

Interpretation: This code returns an array of all unique host cities where the Olympics were held in the dataset. This helps identify the geographic diversity and locations of past Olympic events.

In [69]:
city_counts = df['City'].value_counts().reset_index()
top_cities = city_counts.head(10)
fig = px.pie(top_cities, values='count', names='City', 
             title="Top 10 Most Frequent Olympic Host Cities",
             color_discrete_sequence = px.colors.qualitative.Dark24 ,
             hole=0.4) 
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: This donut chart visualizes the top 10 most frequent Olympic host cities, with Tokyo hosting the most events (15%), followed by London (14.6%) and Athina (10.2%). The chart highlights how certain cities have played a recurring and significant role in hosting the Olympic Games over time.

In [70]:
df['Sport'].unique()
Out[70]:
array(['Basketball', 'Judo', 'Football', 'Tug-Of-War', 'Athletics',
       'Swimming', 'Badminton', 'Sailing', 'Gymnastics',
       'Art Competitions', 'Handball', 'Weightlifting', 'Wrestling',
       'Water Polo', 'Hockey', 'Rowing', 'Fencing', 'Equestrianism',
       'Shooting', 'Boxing', 'Taekwondo', 'Cycling', 'Diving', 'Canoeing',
       'Tennis', 'Modern Pentathlon', 'Golf', 'Softball', 'Archery',
       'Volleyball', 'Synchronized Swimming', 'Table Tennis', 'Baseball',
       'Rhythmic Gymnastics', 'Rugby Sevens', 'Trampolining',
       'Beach Volleyball', 'Triathlon', 'Rugby', 'Lacrosse', 'Polo',
       'Cricket', 'Ice Hockey', 'Racquets', 'Motorboating', 'Croquet',
       'Figure Skating', 'Jeu De Paume', 'Roque', 'Basque Pelota',
       'Alpinism', 'Aeronautics', 'Cycling Road', 'Artistic Gymnastics',
       'Karate', 'Baseball/Softball', 'Trampoline Gymnastics',
       'Marathon Swimming', 'Canoe Slalom', 'Surfing', 'Canoe Sprint',
       'Cycling BMX Racing', 'Equestrian', 'Artistic Swimming',
       'Cycling Track', 'Skateboarding', 'Cycling Mountain Bike',
       '3x3 Basketball', 'Cycling BMX Freestyle', 'Sport Climbing'],
      dtype=object)
In [71]:
df['Sport'].unique().shape
Out[71]:
(70,)

Interpretation: This function retrieves all the unique sports represented in the dataset. This provides an overview of the diversity and range of disciplines included in the Olympic Games over the years.

In [72]:
sport_counts = df['Sport'].value_counts()
top_sports = sport_counts.head(10).reset_index()
top_sports.columns = ['Sport', 'Count']

fig = px.bar(top_sports, x='Sport', y='Count', 
             title="Top 10 Most Popular Olympic Sports",
             color='Count', color_continuous_scale='viridis')
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: The bar chart shows that Athletics, Gymnastics, and Swimming are the most popular Olympic sports based on participation count, with Athletics significantly leading. These sports likely offer more events and opportunities, contributing to their higher frequency in the dataset.

In [73]:
sport_over_time = df.groupby('Year')['Sport'].nunique().reset_index()
fig = px.line(sport_over_time, x='Year', y='Sport',
              title="Number of Unique Sports Over Time",
              markers=True)
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: The number of unique sports in the Olympics has grown steadily from 9 in 1896 to 46 in 2020, highlighting the expansion and modernization of the Games. Milestones like 1960, 2016, and 2020 mark significant boosts in diversity, reflecting efforts to include more disciplines and appeal to broader global interests.

In [74]:
city_sports = df.groupby('City')['Sport'].nunique().reset_index()

fig = px.bar(city_sports.sort_values(by='Sport', ascending=False),
             x='City', y='Sport', 
             title="Number of Unique Sports Hosted by Each City",
             color='Sport', color_continuous_scale='plasma')
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: Among host cities, Tokyo stands out for having hosted the highest number of unique sports, followed by London and Athina, highlighting their major roles in Olympic history and infrastructure capacity.

In [75]:
medalist_events = df[df['Medal'].notna()]['Sport'].value_counts().head(10)

fig = px.bar(medalist_events, x=medalist_events.index, y=medalist_events.values,
             title="Top 10 Sports with Most Medalists",
             labels={'x': 'Sport', 'y': 'Medalists'},
             color=medalist_events.values, color_continuous_scale='magma')
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: Athletics leads as the sport with the most medalists (4,199), reflecting its wide range of events and global participation. Swimming follows closely with 3,256 medalists, showcasing its competitive popularity. Rowing ranks third with 3,089 medalists, indicating its strong Olympic presence despite being less mainstream.

In [76]:
df['Event'].unique().shape
Out[76]:
(957,)

Interpretation: This function retrieves all the unique events represented in the dataset. It offers insight into the variety and scope of competitions held during the Olympic Games, showcasing the range of disciplines and categories across different sports.

In [77]:
event_counts = df['Event'].value_counts()
top_events = event_counts.head(10).reset_index()
top_events.columns = ['Event', 'Count']

fig = px.bar(top_events, x='Event', y='Count', 
             title="Top 10 Most Popular Olympic Events",
             color='Count', color_continuous_scale='viridis')
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: The bar chart shows that Football Men's Football, Hockey Men's Hockey and Water Polo Men's Water Polo are the most popular Olympic events based on participation count, with Football Men's Football significantly leading. These events likely occur more frequently due to their broader representation across multiple editions of the Olympic Games, contributing to their higher occurrence in the dataset.

In [78]:
events_over_time = df.drop_duplicates(['Year','Event'])['Year'].value_counts().reset_index().sort_values('Year')
events_over_time.rename(columns={'Year':'Edition','count':'No of Events'},inplace=True)
In [79]:
fig = px.line(events_over_time,x = 'Edition',y='No of Events',markers=True)
fig.update_layout(
    title="Events over the years",
    title_x=0.5, 
    title_y=0.95)
fig.show()

Interpretation: The number of Olympic events has steadily increased from 43 in 1896 to over 300 in recent editions, reaching 306 in 2020. This growth reflects the expansion of the Games to include more sports, disciplines, and gender-inclusive categories over time.

In [80]:
x = df.drop_duplicates(['Year','Sport','Event'])

plt.figure(figsize=(15,15))
sns.heatmap(x.pivot_table(index='Sport',columns='Year',values='Event',aggfunc='count').fillna(0).astype(int),annot=True,cmap='crest')
plt.title("Number of Events per Sport Over the Years (Olympics)", fontsize=14, fontweight="bold", pad=15)
Out[80]:
Text(0.5, 1.0, 'Number of Events per Sport Over the Years (Olympics)')
No description has been provided for this image
In [81]:
events_per_sport = df.groupby('Sport')['Event'].nunique().reset_index()

fig = px.bar(events_per_sport.sort_values(by='Event', ascending=False),
             x='Sport', y='Event', 
             title="Number of Events per Sport",
             color='Event', color_continuous_scale='Rainbow')
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: Athletics, shooting, and swimming have the highest number of unique events, with 136, 98, and 90 respectively. This indicates these sports have a wide range of disciplines and categories, offering more medal opportunities and highlighting their prominence in the Olympic Games.

In [82]:
city_events = df.groupby('City')['Event'].nunique().reset_index()

fig = px.bar(city_events.sort_values(by='Event', ascending=False),
             x='City', y='Event', 
             title="Number of Unique Events Hosted by Each City",
             color='Event', color_continuous_scale='Jet')
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: Tokyo, London, and Athina have hosted the highest number of unique events, with 469, 398, and 357 respectively. This reflects their roles as major Olympic hosts across multiple editions, showcasing their capacity to support a diverse and extensive range of competitions.

In [83]:
medalist_events = df[df['Medal'].notna()]['Event'].value_counts().head(10)

fig = px.bar(medalist_events, x=medalist_events.index, y=medalist_events.values,
             title="Top 10 Events with Most Medalists",
             labels={'x': 'Event', 'y': 'Medalists'},
             color=medalist_events.values, color_continuous_scale='magma')
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: Men's football, hockey, and water polo are the top events with the most medalists, having 1269, 1048, and 866 respectively. These team-based sports naturally involve a larger number of athletes per event, which contributes to their high medalist counts in the dataset.

In [84]:
df['Name'].unique().shape
Out[84]:
(127751,)

Interpretation: The function returns the number of unique athlete names in the dataset. This indicates the total number of distinct athletes who have participated in the Olympic Games over the recorded years.

In [85]:
athletes_over_time = df.drop_duplicates(['Year','Name'])['Year'].value_counts().reset_index().sort_values('Year')
athletes_over_time.rename(columns={'Year':'Edition','count':'No of Athletes'},inplace=True)
In [86]:
fig = px.line(athletes_over_time,x = 'Edition',y='No of Athletes',markers=True)
fig.update_layout(
    title="Athletes Over the Years",
    title_x=0.5, 
    title_y=0.95)
fig.show()

Interpretation: The number of athletes has shown significant growth over the years, with notable spikes in 1908, 1936,1960, 1972, and especially post-1980. Since 2000, the increase has been more gradual and steady, indicating a plateau in athlete participation at the modern Olympic Games.

In [87]:
athlete_gender = df.groupby(['Year', 'Sex'])['Name'].nunique().reset_index()

fig = px.line(athlete_gender, x='Year', y='Name', color='Sex',markers=True,
              title="Men vs Women Athlete Participation Over the Years")
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: The participation of women in the Olympics has shown a consistent upward trend, rising from just 23 in early editions to 5,556 in 2020, gradually closing the gap with male athletes. This reflects the growing emphasis on gender equality and inclusion in sports over the decades.

In [88]:
sport_athlete_counts = df.groupby("Sport")["Name"].nunique().sort_values(ascending=False).head(10)
fig = px.bar(sport_athlete_counts, x=sport_athlete_counts.index, y=sport_athlete_counts.values, 
             title="Top 10 Sports with Most Unique Athletes",
             labels={"x": "Sport", "y": "Unique Athletes"},
             color=sport_athlete_counts.values,
             color_continuous_scale="Turbo")
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: Athletics, swimming, and rowing have the highest number of unique athletes, with 24.1k, 9.6k, and 8.2k respectively. This highlights their popularity and broad participation across different Olympic editions, especially in sports like athletics which offer numerous event categories.

In [89]:
common_names = df['Name'].str.split().explode().value_counts().head(10)
fig = px.bar(common_names, x=common_names.index, y=common_names.values,
             title="Most Common Athlete Names",
             labels={"x": "Athletes", "y": "Count"},
             color=common_names.values,
             color_continuous_scale="magma")
fig.update_layout(title_x=0.5)
fig.show()
In [90]:
df['Country'].unique().shape
Out[90]:
(232,)

Interpretation: The function returns the number of unique countries represented in the dataset. This reflects the global participation and diversity of nations competing in the Olympic Games over the years.

In [91]:
top_countries = df['Country'].value_counts().head(10)

fig = px.bar(top_countries,x=top_countries.values, y=top_countries.index,
       title="Top 10 Most Represented Countries in the Olympics",
       labels={"x": "No. of Entries", "y": "Country"},
       color=top_countries.values,
       color_continuous_scale='Turbo')
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: The United States, Great Britain, and France are the most represented countries in Olympic history, with 15.7k, 11.3k, and 11k entries respectively. This reflects their long-standing and consistent participation, as well as strong investment in a wide range of sports disciplines.

In [92]:
nations_over_time = df.drop_duplicates(['Year','Country'])['Year'].value_counts().reset_index().sort_values('Year')
nations_over_time.rename(columns={'Year':'Edition','count':'No of Countries'},inplace=True)
In [93]:
fig = px.line(nations_over_time,x = 'Edition',y='No of Countries',markers=True)
fig.update_layout(
    title="Participating Nations over the time",
    title_x=0.5, 
    title_y=0.95)
fig.show()

Interpretation: The number of participating nations has grown significantly from just 12 in 1896 to 206 in 2020, reflecting the increasing global reach and inclusivity of the Olympic Games. Key milestones include major expansions around 1972 and 1996, showcasing the Olympics' evolution into a truly worldwide event.

In [94]:
country_counts = df['Country'].value_counts().reset_index()
country_counts.columns = ['Country', 'Count']

fig = px.treemap(country_counts, path=['Country'], values='Count',
                 title="Olympic Participation by Country",
                 color='Count', color_continuous_scale='viridis')
fig.update_layout(title_x=0.5)
fig.show()

Interpretation: The treemap visually emphasizes the dominance of countries like the United States, Great Britain, and France in Olympic participation, with the largest blocks representing their high entry counts. It also highlights the significant presence of other major nations such as Italy, Germany, and Australia, showcasing their consistent involvement across Olympic editions.